1. Reading and preprocessing
  2. Finding correlations in data
  3. Taking care of outliers in data
  4. Exploring the data

Reading and preprocessing

Reading raw data

General raw data characteristics description: There are 35 attributes within them one classification target attribute: “Attretion” (in polish “Wypalenie”) Meaning that the data serves the purpose of finding out if given employee is eternally fed up with his work and cannot stand it anymore (positive value) or is fine (negative value)

Data set attributes description:

  • 1470 instances (+237, -1233)
  • 18 Qualitative
    • 4 Nominal
    • 4 Binary (counting target attribute)
    • 10 Ordinal
  • 17 Quantitative
    • 15 Discrete
    • 2 Continuous

Additionally there were no missing values found, in case there were some we would use strict approach and get rid of instances with missing entries unless other tactic would be necessary e.g. when we would have to delete considerably big part of data set.

# Reading raw data and entry preprocessing

raw_data = read.csv("data/raw_data.csv")
raw_data = na.omit(raw_data) # strict policy for missing values

# Reshaping raw data and saving it to inspect in Weka (moving target attribute at the end)

raw_weka_data = raw_data %>% relocate(Attrition, .after=YearsWithCurrManager)
write.csv(raw_weka_data, "data/raw_weka_data.csv", row.names = FALSE)

After inspecting each attribute in Weka application we found out those, which can be viewed as irrelevant or useless, specifically:

  • EmployeeCount - Discrete attribute with only one value for all instances
  • EmployeeNumber - Discrete attribute with different value for each instance
  • Over18 - Binary attribute with one value for each instance
  • StandardHours - Discrete attribute with the same value for each instance

Chart interpretation informations:

  • black dotted line stands for mean value
  • green dotted lines stands for mean +/- 3 * standard deviation

Data preprocessing

Preprocessing stage begins with removal of previously mentioned irrelevant attributes. Also, for making data more human readable we decided to rename some Ordinal attributes values from categorical (but numeric) to their actual meaning, this procedure concerns:

  • Education
  • EnvironmentSatisfaction
  • JobInvolvement
  • JobSatisfaction
  • PerformanceRating
  • RelationshipSatisfaction
  • WorkLifeBalance

Additionally there were some simplification in data set column names possible so we renamed:

  • Attribute “ï..Age” to simply “Age”
# Dropping irrelevant attributes

preprocessed_data = raw_weka_data %>%
  select(-EmployeeCount) %>%
  select(-EmployeeNumber) %>%
  select(-Over18) %>%
  select(-StandardHours)

# Renaming some categorical attributes name conceptions

preprocessed_data = preprocessed_data %>% 
  mutate(Education=recode(Education,
                                 `1` = 'Below College',
                                 `2` = 'College',
                                 `3` = 'Bachelor',
                                 `4` = 'Master',
                                 `5` = 'Doctor')) %>%
  mutate(EnvironmentSatisfaction = recode(EnvironmentSatisfaction,
                                 `1` = 'Low',
                                 `2` = 'Medium',
                                 `3` = 'High',
                                 `4` = 'Very High')) %>%
  mutate(JobInvolvement = recode(JobInvolvement,
                                 `1` = 'Low',
                                 `2` = 'Medium',
                                 `3` = 'High',
                                 `4` = 'Very High')) %>%
  mutate(JobSatisfaction = recode(JobSatisfaction,
                                  `1` = 'Low',
                                 `2` = 'Medium',
                                 `3` = 'High',
                                 `4` = 'Very High')) %>%
  mutate(PerformanceRating = recode(PerformanceRating,
                                 `1` = 'Low',
                                 `2` = 'Good',
                                 `3` = 'Excellent',
                                 `4` = 'Outstanding')) %>%
  mutate(RelationshipSatisfaction = recode(RelationshipSatisfaction,
                                 `1` = 'Low',
                                 `2` = 'Medium',
                                 `3` = 'High',
                                 `4` = 'Very High')) %>%
  mutate(WorkLifeBalance = recode(WorkLifeBalance,
                                 `1` = 'Bad',
                                 `2` = 'Good',
                                 `3` = 'Better',
                                 `4` = 'Best'))

# Renaming "ï..Age" attribute to just "Age", the rest is fine

preprocessed_data = rename(preprocessed_data, replace = c("ï..Age" = "Age"))

# Saving early preprocessed data set

write.csv(preprocessed_data, "data/preprocessed_data.csv", row.names = FALSE)

Data set attributes after cleaning, (31 attributes within them 1 target attribute) description:

  • 1470 instances (+237, -1233)
  • 17 Qualitative
    • 4 Nominal
    • 3 Binary (counting target attribute) (-1)
    • 10 Ordinal
  • 14 Quantitative
    • 12 Discrete (-3)
    • 2 Continuous

Inspecting (ONLY) some exemplary attributes in the data set

Finding correlations in data

Let’s start from looking on heatmap of correlations (of course only nominal attributes and target are take into consideration)

And it looks bad.

If color is colder then the correlation is lower, so almost all numeric attributes have non correlation with Attrition. We can observe some correlation, like Age with Total Working Years(Shocking), or Job Level with Monthly Income (Thrilling). This proves how much effort someone, who created this data put into it, to do it as hard as possible.

Taking care of outliers in data

For detecting outliers we chose IQR method. Data is created in such a way, that deleting attributes have no impact on correlation, so to not loss more data we replace outliers value with mean value of given attributes.

We are looking for outliers only for few attributes for some reasons:

We can observe that there are less outliers, but they still exists. Once again, the data is created in such way, that we cannot remove all outliers without losing a big chunk of it.

Now lets look at Monthly Income.

We can observe that detecting any outliers in Monthly Income attribute is pointless so we just do not introduce it and similar ones in final data set.

Exploring the data

Despite data set being so hard to analyze, we can at least try to find something interesting. At the begging lets looks how time spent on one position influences Attrition. some text So, we can for sure observe that there are 3 spikes. First one, the newbie guys on given position. Second one, big chunk of people working for something around 2 years, this will probably rapidly decrease, because some of them go up and some of them drop job, to be really low at 5 years mark. Last spike are people which work 7 years. They are probably satisfied with their position. We can observe, that Attrition will be independent from this group, so we can deduce that this attribute doesn’t influences Attrition.(however we analise some like 10 different attributes in same manners, and from all of them we can have same conclusion).

Lets check how time working under the same manager, working over time(crunch) and incomes influence our target attribute. some text First thing that comes to our minds is that distribution of people working over time, among people who declare burnout is much more balanced, than the others. Other things that we can observe is that we can separate 3 groups of people, that share time worked under current manger. Once again this is 0, 2 and 7.

At the end lets check how Attrition looks when we compare average working years (calculated from Total Working Years and Numbers of Companies Worked) and years at company divided among people with different martial status, and people with different satisfaction from work environment. some text So, people that are single, and have low satisfaction from work environment and works short in given company, and have history of short time of working in others company, declare themselves burned out.

If we start thinking about that, we quickly come up to conclusion that this four symptoms are features of impatient people. And this only builds up ours appreciation for someone who created this data set for practicing Data Science skills.

source: kaggle